AITopics | assembly task

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Ma, Liang, Wen, Jiajun, Lin, Min, Xu, Rongtao, Liang, Xiwen, Lin, Bingqian, Ma, Jun, Wang, Yongxin, Wei, Ziming, Lin, Haokun, Han, Mingfei, Cao, Meng, Chen, Bokui, Laptev, Ivan, Liang, Xiaodan

arXiv.org Artificial IntelligenceNov-24-2025

While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.08708

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

IKEA-Manual: Seeing Shape Assembly Step by Step

Neural Information Processing SystemsNov-15-2025, 21:18:58 GMT

Human-designed visual manuals are crucial components in shape assembly activities.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia (0.15)
North America > United States (0.14)

Industry:

Retail (0.53)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

30ae2af8612ac74357363e8ae877d80c-Paper-Conference.pdf

Neural Information Processing SystemsNov-14-2025, 23:17:29 GMT

artificial intelligence, assembly, machine learning, (20 more...)

Neural Information Processing Systems

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Robots (0.67)

Add feedback

VLM-driven Skill Selection for Robotic Assembly Tasks

Kim, Jeong-Jung, Koh, Doo-Yeol, Kim, Chang-Hyun

arXiv.org Artificial IntelligenceNov-11-2025

Robotic assembly tasks represent one of the most challenging problems in robotics, requiring precise manipulation capabilities combined with sophisticated reasoning about complex multi-step processes. Unlike simple pick-and-place tasks, assembly tasks demand long-term planning that spans multiple sequential actions, where each step must be carefully coordinated with previous and subsequent operations. Furthermore, these tasks require physical understanding of component interactions and spatial relationships between parts [1], [2], [3]. Vision-Language Models (VLMs) have emerged as powerful tools that bridge visual perception and high-level reasoning, offering significant advantages for robotic applications. These models excel at processing visual information while understanding natural language instructions, making them well-suited for complex manipulation tasks.

artificial intelligence, assembly task, natural language, (14 more...)

arXiv.org Artificial Intelligence

2511.0568

Country: Asia > South Korea > Daejeon > Daejeon (0.04)

Genre:

Research Report (0.64)
Workflow (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Rectified Point Flow: Generic Point Cloud Pose Estimation

Sun, Tao, Zhu, Liyuan, Huang, Shengyu, Song, Shuran, Armeni, Iro

arXiv.org Artificial IntelligenceOct-27-2025

We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.

artificial intelligence, machine learning, rectified point flow, (17 more...)

arXiv.org Artificial Intelligence

2506.05282

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.42)

Add feedback

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning

Huang, Binghao, Xu, Jie, Akinola, Iretiayo, Yang, Wei, Sundaralingam, Balakumar, O'Flaherty, Rowland, Fox, Dieter, Wang, Xiaolong, Mousavian, Arsalan, Chao, Yu-Wei, Li, Yunzhu

arXiv.org Artificial IntelligenceOct-21-2025

Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.

demonstration, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2510.1493

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

Tie, Chenrui, Sun, Shengxiang, Lin, Yudi, Wang, Yanbo, Li, Zhongrui, Zhong, Zhouhan, Zhu, Jinxuan, Pang, Yiman, Chen, Haonan, Chen, Junting, Wu, Ruihai, Shao, Lin

arXiv.org Artificial IntelligenceOct-21-2025

Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.16344

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Singapore (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.68)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.48)
(2 more...)

Add feedback

Refinery: Active Fine-tuning and Deployment-time Optimization for Contact-Rich Policies

Tang, Bingjie, Akinola, Iretiayo, Xu, Jie, Wen, Bowen, Fox, Dieter, Sukhatme, Gaurav S., Ramos, Fabio, Gupta, Abhishek, Narang, Yashraj

arXiv.org Artificial IntelligenceOct-14-2025

Simulation-based learning has enabled policies for precise, contact-rich tasks (e.g., robotic assembly) to reach high success rates (~80%) under high levels of observation noise and control error. Although such performance may be sufficient for research applications, it falls short of industry standards and makes policy chaining exceptionally brittle. A key limitation is the high variance in individual policy performance across diverse initial conditions. We introduce Refinery, an effective framework that bridges this performance gap, robustifying policy performance across initial conditions. We propose Bayesian Optimization-guided fine-tuning to improve individual policies, and Gaussian Mixture Model-based sampling during deployment to select initializations that maximize execution success. Using Refinery, we improve mean success rates by 10.98% over state-of-the-art methods in simulation-based learning for robotic assembly, reaching 91.51% in simulation and comparable performance in the real world. Furthermore, we demonstrate that these fine-tuned policies can be chained to accomplish long-horizon, multi-part assembly$\unicode{x2013}$successfully assembling up to 8 parts without requiring explicit multi-step training.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2510.11019

Country: